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Abstract 

We study online boosting, the task of converting any weak online learner into a strong 
online learner. Based on a novel and natural definition of weak online learnability, we develop 
two online boosting algorithms. The first algorithm is an online version of boost-by-majority. By 
proving a matching lower bound, we show that this algorithm is essentially optimal in terms of 
the number of weak learners and the sample complexity needed to achieve a specified accuracy. 
This optimal algorithm is not adaptive, however. Using tools from online loss minimization, 
we derive an adaptive online boosting algorithm that is also parameter-free, but not optimal. 
Both algorithms work with base learners that can handle example importance weights directly, 
as well as by rejection sampling examples with probability defined by the booster. Results are 
complemented with an experimental study. 


1 Introduction 


We study online boosting, the task of boosting the accuracy of any weak online learning algorithm. 
The theory of boosting in the batch setting h as been studied extensively i n the literature and has led 
to a huge practical success. See the book by Schapire and Freund 2ni2| | for a thorough discussion. 


Online learning algorithms receive examples one by one, updating the predictor immediately 
after seeing each new example. In contrast to the batch setting, online learning algorithms typically 
don’t make any stochastic assumptions about the data they observe. They are often much faster, 
more memory-efficient, and apply to situations where the best predictor changes over time as new 
examples keep coming in. 

Given the success of boosting in batch learning, it is natural to ask about the possibility of 
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From a theoretical viewpoint, recent work by I Chen et al.l 2012l | is perhaps most interesting. 
They generalized the batch weak learning assumption to the online setting, and made a connection 
between online boosting and batch boosting that produces smooth distributions over the training 
examples. The resulting algorithm is guaranteed to achieve an arbitrarily small error rate as long 
as the number of weak learners and the number of examples are sufficiently large. No assumptions 
need to be made about how the data is generated. Indeed, the data can even be generated by an 
adversary. 
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Table 1: Comparisons of our results with those of Chen et al. 2012| |. assuming, as in their paper, 
that the weak learner is derived from an online learning algorithm with an 0{VT) regret bound. 
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We pre s ent a new online boosting algorithm, based on the boost-by-majority (BBM ) algorithm 


of Freund, 1995l |. This algorithm, called Online BBM, improves upon the work of Chen et al 


2012 l | in several different aspects: 

1. our assumption on online weak learners is weaker and can be seen as a direct online analogue 
of the weak learning assumption in standard batch boosting, 

2. our algorithm doesn’t require weighted online learning, instead using a sampling t echnique 
similar to the one used in boostin g by filtering in the batch setting [see for example. iFreund . 
I 992 I . Bradley and Schaoire . 2008| |. and 


3. our algorithm is optimal in the sense that no online boosting algorithm can achieve the same 
error rate with less weak learners or less examples asymptotically (see the lower bounds in 
Section [321) . 

A quantitative comparison of our results with those of Chen et al. 2ni2| | appears in Tabled! where 
N and T represent the numbei0 of weak learners and examples needed to achieve error rate e, and 
7 stands for a similar concept of the “edge” of the weak learning oracle as in the batch setting 
(smaller 7 means more inaccurate weak learners). 

A clear drawback of all the algorithms mentioned above is lack of adaptivity. A simple in¬ 
terpretation of this drawback is that all these algorithms require using 7 , an unknown quantity, 
as a parameter. More importantly, this also means that the algorithm treats each weak learner 
equally and ignores the fact that some weak learners are actually doing better than the others. 
The b est example of adaptive boo sting algorithm is the well-known parameter-free AdaBoost algo¬ 
rithm FVeund and Schapird . Il997l | , where each weak learner is naturally weighted by how accurate 
it is. In fact, adaptivity is known to be one of the key features that lead to the practical suc¬ 
cess of AdaBoost, and therefore should also be essential to the performance of online boosting 
algorithms. In Section HI we thus propose AdaBoost.OL, an adaptive and parameter-free online 
boosting algorithm. As shown in Table dl AdaBoost.OL is theoretically suboptimal in terms of 
N and T. However, empirically it generally outperforms OSBoost and sometimes even beats the 
optimal algorithm Online BBM (see Section [5|). 


Our techniques are also very differe nt from those of IChen et ah! |2ni2l | , which rely on the smooth 
boosting algorithm of Servediol 2003 ]. As far as we know, all other work on smooth boosting 


^In this paper, we use the O(-) and fi(-) notation to suppress dependence on polylogarithmic factors in the natural 
parameters. 
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Bshoutv and Gavinskv . 20031 . Bradley and Schaoire . 2008, Barak et al. . 2009 ] cannot be easily 


generalized to the online setting, necessitating completely different methods not relying on smooth 
distributions. Our Online BBM algorithm builds on top of a potential based family that arises nat- 
urally in the ba, t ch setting as approximat e minimax optimal algorithms for so-called drifting games 


Schaoird . l200ll . iLuo and Schanird . l2014i |. The decomposition of each example in that framework 
naturally allows us to generalize it to the online setting where example comes one by one. On the 
other hand, AdaBqost.OL is deriyed by y i ewing boosting from a different angle: loss minimization 


Mason et ah . 2000l . Schapire and Freund . 2012 ]. The theory of online loss minimization is the key 


tool for deyeloping AdaBoost.OL. 

Finally, in Section [U experiments on benchmark data are conducted to show that our new 
algorithms indeed improye oyer preyious work. 


2 Setup and Assumptions 

We describe the formal setup of the task of online classification by boosting. At each time step 
t = 1,..., T, an adyersary chooses an example (xf, yt) G A x {—1,1}, where A is the domain, and 
reyeals xt to the online learner. The learner makes a prediction on its label yt G {—1,1}, and suffers 
the 0-1 loss l{yt 7 ^ Vt}- As is usual with online algorithms, this prediction may be randomized. 

For parameters 7 G (0, ^), 6 G (0,1), and a constant S' > 0, the learner is said to be a weak 
online learner with edge 7 and excess loss S if, for any T and for any input sequence of examples 
(xt,yt) for t = 1 , 2 ,... , T chosen adaptiyely, it generates predictions yt such that with probability 
at least 1 — <5, 

T 

^ l{yt / yt] < (^ - l)T + S. (1) 

t=i 

The excess loss requirement is necessary since an online learner can’t be expected to predict with any 
accuracy with too few examples. Essentially, the excess loss S yields a kind of sample complexity 
bound: the weak learner starts obtaining a distinct edge of 12 ( 7 ) oyer random guessing when T 3> 7 . 
Typically, the dependence of the high probability bound on <5 is polylogarithmic in thus in the 
following we will ayoid explicitly mentioning 8. 

For a giyen parameter e > 0, the learner is said to be a strong online learner with error rate 
e if it satisfies the same conditions as a weak online learner except that its edge is ^ — e, or in 
other words, the fraction of mistakes made, asymptotically, is e. Just as for the weak learner, the 
excess loss S yields a sample complexity bound: the fraction of mistakes made by the strong learner 
becomes 0(e) when T S> j. 

Our main theorem is the following: 

Theorem 1. Given a weak online learning algorithm with edge 7 and excess loss S and any target 
error rate e > 0 , there is a strong online learning algorithm with error rate e which uses 0{^ 111 ( 7 )) 

copies of the weak online learner, and has excess loss 0{^ + ■^); thus its sample complexity is 

o(Kf + ;:p)). Furthermore, if S > ^ 2 ( 7 ), then the number of weak online learners is optimal up 
to constant factors, and the sample complexity is optimal up to polylogarithmic factors. 

The requirement that S > 0 ( 7 ) in the lower bound is not yery stringent; this is precisely the 
excess loss one obtains when using standard online learning algorithms with regret bound 0(VT), 
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as is explained in the discussion following Lemma [ 2 l Furthermore, since we require the bound ([T]) 
to hold with high probability, typical analyses of online learning algorithms will have an 0{-\/T) 
deviation term, which also leads to 5* > ^(^). 

As the theorem indicates, the strong online learner (hereafter referred to as “booster”) works 
by maintaining N copies of the weak online learner, for some positive integer N to be specified 
later. Denote the weak online learners WL* for i = 1, 2,... , A^. At time step t, the prediction of 
i-th weak online learner is given by WL*(xt) G {—1,1}. Note the slight abuse of notation here: 
WL* is not a function, rather it is an algorithm with an internal state that is updated as it is fed 
training examples. Thus, the prediction WL*(xi) depends on the internal state of WL*, and for 
notational convenience we avoid reference to the internal state. 

In each round t, the booster works by taking a weighted majority vote of the weak learners’ 
predictions. Specifically, the booster maintains weights aj G M for i = 1 ,..., corresponding to 
each weak learner, and its final prediction will then b^ yt = sign(X}jCi «tWL*(xt)), where sign(-) 
is 1 if the argument is nonnegative and —1 otherwise. After making the prediction, the true label 
yt is revealed by the environment. The booster then updates WL* by passing the training example 
{xt,yt) to WL* with a carefully chosen sampling probability p] (and not passing the example with 
the remaining probability). The sampling probability pi is obtained by computing a weight wl and 
setting Pt = —, where w* = {w\,W 2 -, ■ ■ ■ ,wfp). At the same time the booster updates al as 

well, and then it is ready to make a prediction for the next round. 

We introduce some more notation to ease the presentation. Let = ?/iWL*(xi) and si = 
sl~^ + alzl with St = 0. Define z* = {z\,Z 2 , ■ ■ ■ Finally, a martingale concentration bound 

using © yields the following bound (proof deferred to Appendix lAl) . The bound can be seen as a 
weighted version of © which is necessary for the rest of the analysis. 

Lemma 1. There is a constant S = 2S + O(^) such that for any T, with high probability, for every 
weak learner WV' we have 

w* • z* > 711’wli - 5||w*||oo. 

2.1 Handling Importance Weights 

Typical online learning algorithms can handle importance weighted examples: each example {xt,yt) 
comes with a weight pt G [0,1], and the loss on that example is scaled by pt, i.e. the loss for 
predicting yt is ptl{yt 7 ^ yt}- Consider the following natural extension to the definition of online 
weak learners which incorporates importance weighted examples: we now require that for any 
sequence of weighted examples {xt,yt) with weight pt G [0,1] for t = 1,2,... ,T, the online learner 
generates predictions yt such that with probability at least 1 — (5, 

T T 

< (i-7)^pt+ 5. (2) 

t=i t=i 

Having access to such a weak learner makes the boosting algorithm simpler: we now simply pass 
every example [xt,yt) to every weak learner WL* using the probability pi = ||^^ 7 |— as importance 

^In Section [ 4 ] a slightly different final prediction will be used. 

®In the algorithm we simply use a tight-enough upper bound on ||w*||cx) (such as the bound from Lemma|4| to 
compute the values p\\ we abuse notation here and use ||w*||oo to also denote this upper bound. 
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weights. The advantage is that the bound ([2]) immediately implies the following inequality for any 
weak learner WL*, which can be seen as a (stronger) analogue of Lemma [H 

w* • z* > 27 ||w*||i - 25||w*||oo. (3) 

Since the analysis depends only on the bound in Lemma [H if we use the importance-weighted 
version of the boosting algorithm, then we can simply use inequality ([3]) instead in the analysis, 
which gives a slightly tighter version of Theorem [H viz. the excess loss can now be bounded by 

0(f)- 

In the rest of the paper, for simplicity of exposition we assume that the p\'s are used as sampling 
probabilities rather than importance weights, and give the analysis using the bound from Lemma[TJ 
In experiments, however, using the pj’s as importance weights rather than sampling probabilities 
led to better performance. 


2.2 Discussion of Weak Online Learning Assumption 


We now justify our definition of weak online learning, viz. inequality ([T]). I n the standard batch 
boost ing case, the corresponding weak learning assumption (see for example ISchapire and Freund 
2012 l |i made is that there is an algorithm which, given a training set of examples and an arbitrary 


distribution on it, generates a hypothesis that has error at most ^ — 7 on the training data under 


the given distribution, 
assumptions; 


This statement can be interpreted as making the following two implicit 


1. (Richness.) Given an edge parameter 7 G (0, ^), there is a set of hypotheses, Ti, such that 
given any training set (possibly, a multiset) of examples U, there is some hypothesis h G 
with error at most 5 — 7 , i-e. 


^y} ^ (5 -7)|t^l- 

(x,y)er/ 


2. (Agnostic Learnability.) For any e G (0,1), there is an algorithm which, given any training 
set (possibly, a multiset) of examples U, can compute a nearly optimal hypothesis h gH, i.e. 

7^y} + ^\u\- 

{x,y)eu ix,y)eU 


Our weak online learning assumption can be seen as arising from a direct generalization of the 
above two assumptions to the online setting. Namely, the richness assumption stays the same, 
whereas the agnost ic learnability of ^ ass umption is replaced by an agnostic online learnability of 
Ti. assumption (c.f. Ben-David et ahl 20^). I.e., there is an online learning algorithm which, given 
any sequence of examples, (x*, yt) for t = 1, 2,... , T, generates predictions yt such that 


Hvt + yt} < inf Y + yt\ + -R(t), 

— raStt — 


t=l 


t=l 


where i? : N —>• M+ is the regret, a non-decreasing, sublinear function of the number of predic¬ 
tion periods T. Since online learning algorithms are typically randomized, we assume the above 
bound holds with high probability. The following lemma shows that richness and agnostic online 
learnability immediately imply our online weak learning assumption ([ 1 ]): 
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Lemma 2. Suppose the sequence of examples {xt,yt) is obtained from a data set for which there 
exists a hypothesis class % that is both rich for edge parameter 27 and agnostically online learnable 
with regret R{-). Then, the agnostic online learning algorithm for TL satisfies the weak learning 
assumption m, with edge 7 and excess loss S = max 7 ’(i?(T) — 7 T). 

Proof. For the given sequence of examples (xt, yt) for t = 1,2,... ,T, the richness with edge param¬ 
eter 27 and agnostic online learnability assumptions on TL imply that with high probability, the 
predictions yt generated by the agnostic online learning algorithm for TL satisfy 

T 

Y,l{yt^yt]<{\-2-t)T + R{T). 

t=i 

It only remains to show that 

(i_ 27 )r + i?(T)<(i-7)T + 5, 

or equivalently, RiT) < 7 T -|- S, which is true by the definition of S. This concludes the proof. □ 

Various agnostic online learning algorithms are known that have a regret bound of 0(Y^Tdn(^); 
for example, a standard experts algorithm on a finite hypothesis space such as Hedge. If we use 
such an online learning algorithm as a weak online learner, then a simple calculation implies, via 

ln(i) 

Lemma [21 that it has excess loss 0( ^ ). Thus, by Theorem [H we obtain an online boosting 
algorithm with near-optimal sample complexity. 


3 An Optimal Algorithm 

In this section, we generalize a family of potential based batch boosting algorithms to the online 
setting. With a specific potential, an online version of boost-by-majority is developed with optimal 
number of weak learners and near-optimal sample complexity. Matching lower bounds will be 
shown at the end of the section. 


3.1 A Potential Based Family and Boost-By-Majority 


In the batch se tting, many boo sting algorithms can be understood in a unified framework called 
drifting games Schapird . [20011]. Here, we generalize the analysis and propose a potential based 
family of online boosting algorithms. 

Pick a sequence of V -|- 1 non-increasing potential functions <hj(s) such that 


^iv(s) > l{s < 0}, 

Then the algorithm is simply to set aj = 1 and w\ = “ 1) “ + !))• The following 

theorem states the error rate bound of this general scheme. 


Lemma 3. For any T and N, with high probability, the number of mistakes made by the algorithm 
described above is bounded as follows: 

T 

ya < ‘^•o(o)r + 5 llw'lloo. 

t=l i 


6 







Proof. The key property of the algorithm is that for any fixed i and t, one can verify the following: 

ch,(4)+^j(4-7) = Msl^+zD+wi{zl-^) = (l-l)c^,(.j-i-l) + (l + 2)cl>,(.j-i + l) < cl>,_i(sr') 

by plugging the formula of wl, realizing that zl can only be —1 or 1, and using the definition of 
<hj_i(s) from Eq. Q. t = 1 to T, we get 


T T 

t=i 


t=i 


Using Lemma [H we get 


T 


E'!’<(»!) s E'*‘-i(T') + siiwiu 

t=i t=i 

which relates the sums of all examples’ potential for two successive weak learners. We can therefore 
apply this inequality iteratively to arrive at: 

T T 




t=i 


t=i 


The proof is completed by noting that 

$7v(sf) > l{sf < 0} = l{yt + yt} 

since ytyt = sign(sf^) by definition. 


□ 


Note that the 5'||w®||oo term becomes a penalty for the final error rate. Therefore, we naturally 
want this penalty term to be relatively small. This is not necessarily true for any choice of the 
potential function. For example, if is the expoiiential potential that leads to a variant of 

AdaBoost in the batch setting [see ISchaoire and Freundl. 120121 . Chap. 13], then the weight wX could 
be exponentially large. 

Fortunately, there is indeed a set of potential functions that produces small weights, whic h, in 
the batch setting, corresponds to an algorithm called boost-by-majority (BBM) Freund 19951 ] . All 
we need to do is to let Eq. (j3|) hold with equality, and direct calculation shows: 






= X] 


fc =0 


N-i 
k 


1^7 

2 2 


and 


w^ = - 


I fN -i 


1 + 2 
2 2 


N-i-k 


N—i—kl 


(5) 


where kl = ^ and (^) is defined to be 0 if A: < 0 or A: > n. In other words, imagine 

flipping a biased coin whose probability of heads is ^ ^ for N — i times. Then <hj(s) is exactly 

the probability of seeing at most (N — i — s)/2 heads and wl is half of the probability of seeing kl 
heads. We call this algorithm Online BBM. The pseudocode is given in Algorithm [TJ 

One can see that the weights produced by this algorithm are small since trivially wl <1/2. To 
get a better result, however, we need a better estimate of ||w*||oo stated in the following lemma. 
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Algorithm 1 Online BBM 
1: for t = 1 to r do 
2 : Receive example Xf 

3: Predict yt = sign(^^j^ WL*(xt)), receive label yt- 

4: Set = 0. 

5: for i = 1 to iV do 

6 : Set si = sl~^ + ytWV{xt). 

7: Set ki = . 

9: Pass training example {xt,yt) to WL* with probability p] = 

10: end for 

11: end for 


Lemma 4. If w\ is defined as in Eq. then we have w\ = 0{l/y/N — i) for any i < N. 

This lemma was essentially proven before by Freund 19931 . Lemma 2.3.10]. We give an al¬ 
ternative and simpler proof in Appendix [B] in the supplementary material by using Berry-Esseen 
theorem directly. We are now ready to state the main results of Online BBM. 


Theorem 2. For any T and N, with high probability, the number of mistakes made by the Online 
BBM algorithm is bounded as follows: 

exp(-iA72)r + diVNiS + i)). (6) 

Thus, in order to achieve error rate e, it suffices to use N = 0(::pln^) weak learners, which gives 
an excess loss bound o/0(^ -|- 

Proof. A direct application of Hoeffding’s inequality gives ‘ho(O) < exp(—A7^/2). With Lemma[3] 
we have 

Applying Lemma [3] proves Eq. ([6j). Now if we set N = In then 
T 

E ^{yt / yt) < eT + d{VN{S + 1)) = eT + 0(f + 

t=i 


□ 


3.2 Matching Lower Bounds 

We give lower bounds for the number of weak learners and the sample complexity in this section 
that show that our Online BBM algorithm is optimal up to logarithmic factors. 



















Theorem 3. For any 7 E (0, \), S > ^ , 5 E (0,1) and e E (0,1), there is a weak online learning 

algorithm with edge 7 and excess loss S satisfying m with probability at least 1 — 5, such that to 
achieve error rate e, an online boosting algorithm needs at least r 2 (;^lni) weak learners and a 
sample complexity of Q{-^) = ^^( 7(7 + ^))- 

Proof. The proof of both lower bounds use a similar construction. In either case, all examples’ 
labels are generated uniformly at random from {—1,1}, and in time period t, each weak learner 
outputs the correct label yt independently of all other weak learners and other examples with a 
certain probability pt to be specified later. Thus, for any T, by the Azuma-Hoeffding inequality, 
with probability at least 1 — 5, the predictions yt made by the weak learner satisfy 
T T 

^ i{yt + yj < ^(1 - Vi) + V2rin(i) < ^(i -p,) + ^T+^^ (7) 

t=i t=i t=i 

where the last inequality follows by the arithmetic mean-geometric mean inequality. We will now 
carefully choose pt so that inequality ([7|) implies inequality ([T]) . 

For the lower bound on the number of weak learners, we set Pt = \ + 27 , so that inequality ([7]) 
implies that with probability at least 1 — 5, the predictions yt made by the weak learner satisfy 

^ Infit 

E Hyt / m} < (i - 7)r + ^ < (i - 7)r + F. 


Thus, the weak online learner has edge 7 with excess loss S. In this case, the Bayes optimal output 
of a booster using N weak learners is t o simply take a majority vote of all the weak learners [see 
for instance Schaoire and Freund . 120121 . Chap. 13.2.6], and the probability that the majority vote 
is incorrect is 0(exp(— 8 A^ 7 ^)). Setting this error to e and solving for N gives the desired lower 
bound. 

Now we turn to the lower bound on the sample complexity. We divide the whole process into 
two phases: for t < Tq = we set pt = and for t > Tq, we set Pt = ^ + 27 . Now, if T < Tq, 

inequality ([7|) implies that with probability at least 1 — 5, the predictions ijt made by the weak 
learner satisfy 

T , . 1 , 

Y.i{yt^m}<{k + i)T + ^<{\-i)T + s 

t=i 


27 


( 8 ) 


using the fact that T < Tq = -^ and S > . Next, if T > Tq, let T' = T — Tq, and again 

inequality ([7]) implies that with probability at least 1 — 5, the predictions yt made by the weak 
learner satisfy 

^ l{yt / yt} < \Tq + (1 - 27)r' + yT + = (i - ^)T + 2 ^Tq + ^ < (1 - 7)r + S, (9) 




Inequalities ([ 8 ]) and ([9]) imply that the weak online learner has edge 7 with excess 


since S > 
loss S. 

However, in the first phase (i.e. t < Tq), since the predictions of the weak learners are uncor¬ 
related with the true labels, it is clear that no matter what the booster does, it makes a mistake 
with probability Thus, it will make 11{Tq) mistakes with high probability in the first phase, and 
thus to achieve e error rate, it needs at least FL{TQ/e) = examples. □ 
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4 An Adaptive Algorithm 


Although the Online BBM algorithm is optimal, it is unfortunately not adaptive since it requires the 
knowledge of 7 as a parameter, which is unknown ahead of time. As discussed in the introduction, 
adaptivity is essential to the practical performance of boosting algorithms such as AdaBoost. 

In this section we thus study adaptive online boosting algorithms using the theory of online 
loss minimization as the main tool. It is known that boosting can be viewed as trying to find a 
linear combination of weak hypotheses to minim ize the total loss of the train ing examples, usually 
using functional gradient descent [see for details ISchapire and Freundl. l2012l . Chap. 7]. AdaBoost, 
for instance, minimizes the exponential loss. Here, as discussed before, we intuitively want to 
avoid using exponential loss since it could lead to large weights. Instead, we will consider logistic 
loss Hs) = Infl + expf—sib which results in an algorithm called AdaBoost.L in the batch setting 
Schaoire and Freund . 20121 . Chap. 7]. 

In the online setting, we conceptually define N different “experts” giving advice on what to 
predict on the current example xj. In round t, expert i predicts by combining the first i weak 
learners: y] = sign(^*^^ WL-^ (x*)). Now as in AdaBoost.L, the weight w\ for WL* is obtained 
by computing the logistic loss of the prediction of expert * — 1, i.e. and then setting wl to 

be the negative derivative of the loss: 


W- 


i = -nsi^) = 


l + exp(sj 


G[0,1]. 


In terms of the weight of WL®, i.e. a\, ideally we wish to mimic AdaBoost.L and use a fixed a® 
for all t such that the total logistic loss is minimized: a® = argminQ, Ylt'=i Of course 

this is not possible because a® depends on the future unknown examples. Nevertheless, it turns 
out that we can almost achieve that using tools from online learning theory. Indeed, one of the 
fundamental topics in online learning is exactly how to perform almost as well as the best fixed 
choice (a®) in the hindsight. 

Specifically, it turns out that it suffices to restrict a to the feasible set [—2,2]. Then consider 
the following simple one dimensional online learning problem: on each round t, algorithm predicts 
al from a feasible set [—2, 2]; the environment then reveals loss function /i(a) = £(sl~^ + azl) and 
the algorithm suff ers loss ftial). There a re many so-called “low-regret” algorithms in the literature 
(see the survey by Shalev-Shwartz 201 ll |) for this problem ensuring 


T 


t=i 


a, - 


T 

min > ft ( 


a) < R 




where is sublinear in T so that on average it goes to 0 when T is large and the algorithm is 
thus doing almost as well as the best c onstant choice «® . The simplest low-regret algorithm in this 
case is perhaps online gradient descent IZinkevichl 2003(1 : 


al+, = U{al-ytfl{ai))=ulai + 


mzt 


1 -b exp(s®] 


where rjt is a time-varying learning rate and H represents projection onto the set [—2,2], i.e., 
n(-) = max{—2, min{2, •}}. Since the loss function is actually 1-Lipschitz (|//(a)| < 1), if we set rg 
to be ^j^/t, then standard analysis shows Rtp = 4\/T. 
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Algorithm 2 AdaBoost.OL 
1: Initialize: 'ii : v\ = = 0. 

2: for t = 1 to r do 
3: Receive example x^. 

4: for i = 1 to iV do 

5: Set yi = sign(X;}=i afWLJ(xt)). 

6: end for 

7: Randomly pick it with Pr[zt = i] (x vl- 

8 : Predict yt = , receive label yt- 

9: Set = 0. 

10; for i = 1 to A do 
11; Set = ?/tWR(xt). 

12; Set s\ = sl~^ + alzl- 

13; Set oj+i =u(^al + with rjt = 4/Vt. 

14; Pass example {-x.t,yt) to WL* with probabilit30 pj = icj = 1/(1 + exp(sj~^)). 
15; Set ul+i = vi • expl-l{yt / yj}). 

16; end for 
17; end for 


Finally, it remains to specify the algorithm’s final prediction yt- In Online BBM, we simply used 
the advice of expert N. Unfortunately the algorithm described in this section cannot guarantee 
that expert N will always make highly accurate predictions. However, as we will show in the proof 
of Theorem 01 the algorithm does ensure that at least one of the N experts will have high accuracy. 
Therefore, what we really need to do is to decide which expert to follow on each round, and try 
to predict almost as well as the best fixed expert in the hindsight. This is again another classic 
online learning problem (calle d expert or hedge problern ) , and can be solved, for ins t ance, by the 


well-known Hedge algorithm Littlestone and Warmutl] . 1994 . Freund and Schaoire . 1997]. The 


idea is to pick an expert on each round randomly with different importance weights according to 
their previous performance. 

We call the final resulting algorithm AdaBoost.OL (O stands for online and L stands for logistic 
loss), and summarize it in Algorithmic Note that as promised, AdaBoost.OL is an adaptive online 
boosting algorithm and does not require knowing 7 in advance. In fact, in the analysis we do not 
even assume that the weak learners satisfy the bound dH). Instead, define the quantities ji = ^ 

for each weak learner WL*. This can be interpreted as the (weighted) edge over random guessing 
that WL* obtains. Note that 7^ may even be negative, which means flipping the sign of WL*’s 
predictions performs better than random guessing. Nevertheless, the algorithm can still make 
accurate predictions even with negative ji since it will end up choosing negative weights al in that 
case. The performance of AdaBoost.OL is provided below. 


Theorem 4. For any T and N, with high probability, the number of mistakes made by AdaBoost. OL 
is bounded by 

2T \ 

EoU VEoiT 

■^Note that we are using the bound ||w*||oo < 1 here. 
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Proof. Let the number of mistakes made by expert i be Mi = Ylt=i / yj}) also define Mq = T 
for convenience. Note that AdaBoost.OL is using a variant of the Hedge algorithm with \{yt 7 ^ 
ilVt being the loss of expert i on round t (Line 7 and 15). So by standard analysis [see e.g. 


Cesa-Bianchi and Lugosi 120061 . Corollary 2.3], and the Azuma-Hoeffding inequality, we have with 
high probability 


l{yt / yt} < 2 min Mi + 21 n(N) + d{VT). 


( 10 ) 


t=i 


Now, whenever expert i—1 makes a mistake (i.e. si < 0), we have wl = l/(l+exp(sj )) > 1/2 


and therefore 


|w*||i > Mi_i/ 2 . 


( 11 ) 


Note that Eq. (Ilip holds even for i = 1 by the definition of Mq. We now bound the difference 
between the logistic loss of two successive experts, Aj = Online gradient 

descent (Line 13) ensures that 


i(sl) < min £(sl ^ + azl) + aVt, 

h /it 


( 12 ) 


as discussed previously. On the other hand, direct calculation shows i{sl ^ + azl) ~ ) = 

In fl + — 1)^ < rcj(e““^‘ — 1). With Ui = Y^=i = 1} = ^ + 7ij have 


min y^ (£(sl ^ + azl) — £{s\ ^)) < min ||w*||i((Tie " + (1 — (Ti)e" — 1 ) 
ae[- 2 , 2 ] ^ ae[- 2 , 2 ] 

< -l||w*||i(2(Ji - 1)2 

= -27-IIwill 


(13) 

(14) 

(15) 


Here, inequality ()13p follows from Lemma [5] and inequality (11511 from inequality (11111 . The above 
inequality and inequality (fT^ imply that 

Ai < -yfMi.i+dVr. 


Summing over i = 1,... ,N and rearranging gives 

N T T 


7'M,_i + y; e{sf) < Y m +^nvt 


i=l 


t=l 


t=l 


which implies that 


minMj < minMj_i < \ t + — t^Vt 


since Mj < Mq for all i, £{s^) > 0 for all t and £(0) = ln(2). Using this bound in inequality (fTOll . 
we get 


Y ^ yt^ 


< 


21 n( 2 )r ~ f NVt 


t=i 


E. 7 ; 


,2 

I H 


+ 0 


IE.71 


+ ln(A) < 


2T 


+ 0 




2 I ’ 

2 li 
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where the last inequality follows from the bound ^ where c is the hidden 

0 ( 1 ) factor in the term, using the arithmetic mean-geometric mean inequality. □ 

For the case when the weak learners do satisfy the bound ([T]), we get the following bound on 
the number of errors: 

Theorem 5. If the weak learners satisfy (If), then for any T and N, with high probability, the 
number of mistakes made by AdaBoost. OL is bounded by 

8T ~ 

Thus, in order to achieve error rate e, it sujfices to use N > weak learners, which gives an 
excess loss bound of 0{^ + ^)- 

Proof. The proof is on the same lines as that of Theorem 01 The only change is that in inequality 
m, we use the bound 7 ^^ > which follows from Lemma [1] using the fact that a > b — c 

implies > 6 ^ — 26c for non-negative a,b and c, and the fact that ||w®||oo < 1- This leads to the 
following change in inequality (|15p : 

2 

min , Y1 

«el-2,2] 4 

Continuing using this bound in the proof and simplifying, we get the stated bound on the number 
of errors. □ 

The following lemma is a simple calculation: 

Lemma 5. For any a £ [0,1], 

min cje"“-b(l-CT)e" < l-i( 2 CT-l) 2 . 
ae[-2,2] 

Proof. It suffice to prove the bound for ct > the bound for u < ^ follows by simply using the 
bound for 1 — a. For a G [0.5,0.95], setting a = ^ ln(^ 7 ^) G [—2, 2] gives 

<Te““ + (1 — (y)e°‘ = \/4(t(1 — a) < 1 — ^( 2 cr — 1 )^, 

since — x < 1 — for x G [0,1]. For a G (0.95,1], setting a = ^ ln(2^) g [—2, 2] we have 

o-e"“ + (1 - fT)e“ < 0.95e"" 0.05e" = Vol9 

< 1 <l_i(2a-l)2. 

□ 

Although the number of weak learners and excess loss for Adaboost.OL are suboptimal, the 
adaptivity of AdaBoost.OL is an appealing feature and leads to good performance in experiments. 
The possibility of obtaining an algorithm that is both adaptive and optimal is left as an open 
question. 
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Table 2: Below, d is the number of unique features in the dataset, and s is the average number of 
features per example. 


Dataset 

instances 

s 

d 

20 news 

18,845 

93.9 

101,631 

a9a 

48,841 

13.9 

123 

activity 

165,632 

18.5 

20 

adult 

48,842 

12.0 

105 

bio 

145,750 

73.4 

74 

census 

299,284 

32.0 

401 

covtype 

581,011 

11.9 

54 

letter 

20,000 

15.6 

16 

maptaskcoref 

158,546 

40.4 

5,944 

nomao 

34,465 

82.3 

174 

poker 

946,799 

10.0 

10 

rcvl 

781,265 

75.7 

43,001 

vehv 2 binary 

299,254 

48.6 

105 


5 Experiments 


While the focus of this paper is a theoretical investigation of online boosting, we have also performed 
experiments to evaluate our algorithms. 


We extended the Vowpal Wabbit open source machine learning system IVWl to include the 
algorithms studied in this paper. We used VW’s default base learning algorithm as our weak 
learner, tuning only the learning rate. The online boosting algorithms implemented were Online 
BBM, AdaBoost.OL, O SBoost (using uniform weighting on the weak learners) and OSBoost.OCP 


from Chen et ah . 2012l |. all using importance weighted examples in VW. We also implemented 


AdaBoost.OL.S, which is the version of AdaBoost.OL where examples sent to VW are sampled 
rather than weighted. 

All experiments were done on a diverse collection of 13 publically available datasets. The 
datasets come from the UCI repository, KDD Cup challenges, and the HCRC Map Task Corpus. 
A description of these datasets is given in Table O For each dataset, we performed a random split 
with 80% of the data used for training and the remaining 20% for testing. We tuned the learning 
rate, the number of weak learners, and the edge parameter 7 (for all but AdaBoost.OL) using 
progressive validation 0-1 loss on the training set. Reported is the 0-1 loss on the test set. 

It should be noted that the VW baseline is already a strong learner. The results obtained are 
given in Table [3l As can be seen, for most datasets. Online BBM had the best performance. The 
average improvement of Online BBM over the baseline was 5.14%. For AdaBoost.OL, it was 2.57%. 
Using sampling in AdaBoost.OL (i.e. AdaBoost.OL.S) boosts the average to 2.67%. The average 
improvement for OSBoost.OCP was 1.98%, followed by OSBoost with 1.13%. 
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Table 3: Performance of various online boosting algorithms on various datasets. The lowest loss 
attained for each dataset is bolded. The baseline is the loss obtained by running the weak learner, 
VW, on the data. 


Dataset 

VW baseline 

Online BBM 

AdaBoost.OL 

AdaBoost.OL.S 

OSBoost.OCP 

OSBoost 

20news 


0.0775 

0.0777 



0.0801 

a9a 

■KH 

0.1495 

0.1497 



0.1505 

activity 


0.0114 

0.0128 



0.0133 

adult 

0.1543 

0.1526 

0.1536 


0.1539 

0.1544 

bio 

0.0035 

0.0031 

0.0032 

0.0032 

0.0033 

0.0034 

census 

0.0471 

0.0469 

0.0469 

0.0469 

0.0469 

0.0470 

covtype 

0.2563 

0.2347 

0.2495 

0.2450 

0.2470 

0.2521 

letter 

0.2295 

0.1923 

0.2078 

0.2078 

0.2148 

0.2150 

maptaskcoref 

0.1091 

0.1077 

0.1083 

0.1083 

0.1093 

0.1091 

nomao 

0.0641 

0.0627 

0.0635 

0.0635 

0.0627 

0.0633 

poker 

0.4555 

0.4312 

0.4555 

0.4555 

0.4555 

0.4555 

rcvl 

0.0487 

0.0485 

0.0484 

0.0484 

0.0488 

0.0488 

vehv2binary 

0.0292 

0.0286 

0.0291 

0.0291 

0.0284 

0.0286 
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A Proof of Lemma [T] 

Proof. Fix a weak learner, say WL*. Let 


U = {t : passed to WL*}. 

Since inequality ([T|) holds even for adaptive adversaries, with high probability we have 

T 

Y, l{WL»(xi) + yt}l{t eU}<C^-j)\U\+S. (16) 

t=i 

Now fix the internal randomness of WL*. Note that € U}] = pi = —, where Et[-] is 

the expectation conditioned on all the randomness of the booster until (and not including) round 
t. Define a = Ylt=iPt- 

We now show using martingale concentration bounds that with high probability, 

T T 

l{WP(xi) + yt}pl < Y, l{WP(xt) + yt]l{t G C/} + O (17) 

t=i t=i 

and 

\U\ < a + O {^/^) . (18) 

Here, the O(-) notation suppresses dependence on loglog(T). 

To prove inequality (117h . consider the martingale difference sequence 

W = l{WV(xt) + yt}l{t eu}- l{WL*(xi) + yt}pl. 


Note that \Xt\ < 1, and the conditional variance satisfies 


YaTt[Xt\Xi,X2,...,Xt-i]<pl. 

Then, by Lemma 2 of Bartlett et al. 11^, for any 6 < 1/e and assuming T > 4, with probability 
at least 1 — log2(T')5, we have 


Y^t< 2 max I 2 v/ct, I = ^(^)’ 

by choosing S <C log^^(r) • This implies inequality (fT71) . Inequality (fT8]l is proved similarly. Note 

that these high probability bounds are conditioned on the internal randomness of WL*. By taking 
an expectation of this conditional probability over the internal randomness of WL*, we conclude 
that inequalities (flTll and (fTSl) hold with high probability unconditionally. 

Via a union bound, inequalities (flGl) . (fT7|) and (fTSjl all hold simultaneously with high probability, 
which implies that 

T 

Y l{WL*(xi) / ytW, < (1 _ ^)^ + s + O (V^) . (19) 

t=i 
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Using the facts that pi = — and l{WL*(xj) / yt} = and simplifying, we get 

w* • z* > 27 ||w*||i - 2 S'||w*||oo - O(^||w*||i||w*||oo) 

> 27 ||w*||i - 2S’||w*||oo - 7l|w*||i - 
= 7||w*||i-25||w*||oc-0(^). 

The second inequality above follows from the arithmetic mean-geometric mean inequality. This 
gives us the desired bound. The high probability bound for all weak learners follows by taking a 
union bound. □ 


B Proof of Lemma 


Proof. Let X ~ B(m,p) be a binomial random variable where m = N — i and p = 1/2 + 7 / 2 . Also 
lei q = 1 — p and Fx be the CDF of X. By the definition of rcj, we have wl < ^ max^ Pr{A = k}. 
We will approximate A by a Gaussian random variable G ^ N(mp, mpq) with density function / 
and CDF Fq. Note that 

I Pr{A = A;} - r f{G)dG\ = \ {Fx{k) - Fx{k - 1)) - (Fcik) - F^k - 1)) | 

Jk-l 

< \Fxik) - FG{k)\ + \Fx{k - 1) - Fcik - 1)|. 

So by applying Berry-Esseen theorem to the above two CDF differences between X and G, we 
arrive at 

2 C(p 2 + g 2 ) 


Pr{A = A;}- [ f{G)dG 
Jk-l 


< 


y/mpq 

where G is the universal constant stated in Berry-Esseen theorem. It remains to point out that 


pk 

Pr{A = A;} < / f{G)dG + 

Jk-l 

< max/(G) -|- 


G&R 


+ 


2G{p^ + q^) 
y/mpq 
2G{p^ + q^) 
y/mpq 
2G{p^ + q^) 


y/Urmpq y/mpq 


= O 


since pq = l/A — 7^/4 > 3/16. 


□ 
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